{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification\n",
    "Classify examples into given set of categories.\n",
    "\n",
    "### Examples of Classification Problems\n",
    "* Text Categorization (e.g. Spam Filtering)\n",
    "* Classification of Apple and Oranges\n",
    "* Fruad Detection\n",
    "* Face Detection\n",
    "* Optical Character Recognition\n",
    "* Natural Language Processing\n",
    "\n",
    "### Packges\n",
    "The main packages used in this project are\n",
    "* Sklearn (For accessing classifier)\n",
    "* Numpy\n",
    "* matplotlib ( For ploting)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import all the libraries here\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.gaussian_process import GaussianProcessClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Classifier \n",
    "The classifer used for the problem are: \n",
    "   * Decision Tree Classifier\n",
    "   * KNeighbors Classifier\n",
    "   * Guassian Process Classifier\n",
    "   * Random Forest Classifier\n",
    "   * Ada-Boost Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initializing all classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "decisionClf = DecisionTreeClassifier()\n",
    "knnClf = KNeighborsClassifier()\n",
    "gpcClf = GaussianProcessClassifier()\n",
    "# allowing bootstrap\n",
    "rpcClf = RandomForestClassifier(bootstrap=True)\n",
    "adaBoostClf = AdaBoostClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  2. Import and visualize data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# [height, weight, shoe_size]\n",
    "X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],\n",
    "     [190, 90, 47], [175, 64, 39],\n",
    "     [177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]\n",
    "\n",
    "Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',\n",
    "     'female', 'male', 'male']\n",
    "\n",
    "#TEST DATA[height, weight, shoe_size]\n",
    "test_X = [[179, 90, 44], [190, 88, 44], [165, 55, 37], [160, 60, 39], [156, 56, 36], [181, 85, 43], [174, 66, 40],\n",
    "     [177, 70, 43], [159, 66, 47], [188, 100, 44], [179, 84, 47]]\n",
    "\n",
    "test_Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### a) Decision Tree Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Decision Tree Classifier\n",
      "Score:  0.64 \n",
      "Variance score: 0.64\n"
     ]
    }
   ],
   "source": [
    "decisionClf = decisionClf.fit(X, Y)\n",
    "prediction = decisionClf.predict(test_X)\n",
    "\n",
    "# Explained variance score: 1 is perfect prediction\n",
    "print('Decision Tree Classifier')\n",
    "print('Score:  %.2f ' % accuracy_score(test_Y, prediction))\n",
    "print('Variance score: %.2f' % decisionClf.score(test_X, test_Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### b) KNeighbors Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KNeighbors Classifier\n",
      "Score:  0.73 \n",
      "Variance score: 0.73\n"
     ]
    }
   ],
   "source": [
    "knnClf = knnClf.fit(X, Y)\n",
    "prediction = knnClf.predict(test_X)\n",
    "\n",
    "# Explained variance score: 1 is perfect prediction\n",
    "print('KNeighbors Classifier')\n",
    "print('Score:  %.2f ' % accuracy_score(test_Y, prediction))\n",
    "print('Variance score: %.2f' % knnClf.score(test_X, test_Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### c) Guassian Process Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Guassian Process Classifier\n",
      "Score:  0.73 \n",
      "Variance score: 0.73\n"
     ]
    }
   ],
   "source": [
    "gpcClf = gpcClf.fit(X, Y)\n",
    "prediction = gpcClf.predict(test_X)\n",
    "\n",
    "print('Guassian Process Classifier')\n",
    "# Explained variance score: 1 is perfect prediction\n",
    "print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))\n",
    "print('Variance score: %.2f' % gpcClf.score(test_X, test_Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### d) Random Forest Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 212,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random Forest Classifier\n",
      "Score:  0.82 \n",
      "Variance score: 0.82\n"
     ]
    }
   ],
   "source": [
    "rpcClf = rpcClf.fit(X, Y)\n",
    "prediction = rpcClf.predict(test_X)\n",
    "\n",
    "# Explained variance score: 1 is perfect prediction\n",
    "print('Random Forest Classifier')\n",
    "print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))\n",
    "print('Variance score: %.2f' % rpcClf.score(test_X, test_Y))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### e) Ada-Boost Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 209,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ada-Boost Classifier\n",
      "Score:  0.73 \n",
      "Variance score: 0.73\n"
     ]
    }
   ],
   "source": [
    "adaBoostClf = adaBoostClf.fit(X, Y)\n",
    "prediction = adaBoostClf.predict(test_X)\n",
    "\n",
    "# Explained variance score: 1 is perfect prediction\n",
    "print('Ada-Boost Classifier')\n",
    "print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))\n",
    "print('Variance score: %.2f' % adaBoostClf.score(test_X, test_Y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1' '1' '0' '0' '0' '1' '0' '1' '1' '1' '1']\n"
     ]
    }
   ],
   "source": [
    "#not used, just for testing purposes \n",
    "maleIndex = [item for item in range(len(prediction)) if prediction[item] == 'male']\n",
    "femaleIndex = [x for x in range(len(prediction)) if prediction[x] == \"female\"]\n",
    "\n",
    "prediction[maleIndex] = 1\n",
    "prediction[femaleIndex] = 0\n",
    "\n",
    "print(prediction )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "**Random Forest Classifier**\n",
    "```\n",
    "Score:  0.82 \n",
    "Variance score: 0.82\n",
    "```\n",
    "\n",
    "**Ada-Boost Classifier**\n",
    "```\n",
    "Score:  0.73 \n",
    "Variance score: 0.73\n",
    "```\n",
    "\n",
    "**Guassian Process Classifier**\n",
    "```\n",
    "Score:  0.73 \n",
    "Variance score: 0.73\n",
    "```\n",
    "\n",
    "**Decision Tree Classifier**\n",
    "```\n",
    "Score:  0.64 \n",
    "Variance score: 0.64\n",
    "```\n",
    "\n",
    "**KNeighbors Classifier**\n",
    "```\n",
    "Score:  0.73 \n",
    "Variance score: 0.73\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "The best answer score of Random Forest Classfier is 0.82 and worst case score is 0.63, this is because random forest classifier in order to improve the predictive accuracy and over-fitting it mean prediction (regression) of multiple trees and thus each time we get score different since we don't have enough data.\n",
    "On the otherhand, all the other Classifier are getting same score so for the above example we can use any classifier if you want to improve the results you must increase the amount of data. Since, here we are generated data manually but if you want that these classifier give you more accurate result than you must use more data this will allow the above algorithm to generalize its learns parameter more.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "* [Statistical classification](https://en.wikipedia.org/wiki/Statistical_classification)\n",
    "* [Machine Learning Algorithms for Classification](http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf)\n",
    "* [Introduction - Learn Python for Data Science #1](https://www.youtube.com/watch?v=T5pRlIbr6gg&index=1&list=PL2-dafEMk2A6QKz1mrk1uIGfHkC1zZ6UU)\n",
    "* [Random Forest](https://en.wikipedia.org/wiki/Random_forest)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
